Data Overview¶

We will work on a dataset that contains information about bookings from City hotel and Resort hotel. The dataset can be found (https://www.sciencedirect.com/science/article/pii/S2352340918315191#bib6)

Meal: Categories are presented in standard hospitality meal packages: Undefined/SC – no meal package; BB – Bed & Breakfast; HB – Half board (breakfast and one other meal – usually dinner); FB – Full board (breakfast, lunch and dinner)

Goals¶

In this project, we will analyze the dataset to find out the answers to the following questions:

  1. Home country of guests
  2. Guests' preference for meal
  3. Price variation over the year

Data Cleaning and Preparation¶

We will start by importing some useful libraries we need in this project

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
In [2]:
df=pd.read_csv('hotel_bookings.csv')

the dataset is loaded, now we will gather some basic information about thedataframe.

In [3]:
df.head()
Out[3]:
hotel is_canceled lead_time arrival_date_year arrival_date_month arrival_date_week_number arrival_date_day_of_month stays_in_weekend_nights stays_in_week_nights adults ... deposit_type agent company days_in_waiting_list customer_type adr required_car_parking_spaces total_of_special_requests reservation_status reservation_status_date
0 Resort Hotel 0 342 2015 July 27 1 0 0 2 ... No Deposit NaN NaN 0 Transient 0.0 0 0 Check-Out 7/1/2015
1 Resort Hotel 0 737 2015 July 27 1 0 0 2 ... No Deposit NaN NaN 0 Transient 0.0 0 0 Check-Out 7/1/2015
2 Resort Hotel 0 7 2015 July 27 1 0 1 1 ... No Deposit NaN NaN 0 Transient 75.0 0 0 Check-Out 7/2/2015
3 Resort Hotel 0 13 2015 July 27 1 0 1 1 ... No Deposit 304.0 NaN 0 Transient 75.0 0 0 Check-Out 7/2/2015
4 Resort Hotel 0 14 2015 July 27 1 0 2 2 ... No Deposit 240.0 NaN 0 Transient 98.0 0 1 Check-Out 7/3/2015

5 rows × 32 columns

In [4]:
df['country'].unique()
Out[4]:
array(['PRT', 'GBR', 'USA', 'ESP', 'IRL', 'FRA', nan, 'ROU', 'NOR', 'OMN',
       'ARG', 'POL', 'DEU', 'BEL', 'CHE', 'CN', 'GRC', 'ITA', 'NLD',
       'DNK', 'RUS', 'SWE', 'AUS', 'EST', 'CZE', 'BRA', 'FIN', 'MOZ',
       'BWA', 'LUX', 'SVN', 'ALB', 'IND', 'CHN', 'MEX', 'MAR', 'UKR',
       'SMR', 'LVA', 'PRI', 'SRB', 'CHL', 'AUT', 'BLR', 'LTU', 'TUR',
       'ZAF', 'AGO', 'ISR', 'CYM', 'ZMB', 'CPV', 'ZWE', 'DZA', 'KOR',
       'CRI', 'HUN', 'ARE', 'TUN', 'JAM', 'HRV', 'HKG', 'IRN', 'GEO',
       'AND', 'GIB', 'URY', 'JEY', 'CAF', 'CYP', 'COL', 'GGY', 'KWT',
       'NGA', 'MDV', 'VEN', 'SVK', 'FJI', 'KAZ', 'PAK', 'IDN', 'LBN',
       'PHL', 'SEN', 'SYC', 'AZE', 'BHR', 'NZL', 'THA', 'DOM', 'MKD',
       'MYS', 'ARM', 'JPN', 'LKA', 'CUB', 'CMR', 'BIH', 'MUS', 'COM',
       'SUR', 'UGA', 'BGR', 'CIV', 'JOR', 'SYR', 'SGP', 'BDI', 'SAU',
       'VNM', 'PLW', 'QAT', 'EGY', 'PER', 'MLT', 'MWI', 'ECU', 'MDG',
       'ISL', 'UZB', 'NPL', 'BHS', 'MAC', 'TGO', 'TWN', 'DJI', 'STP',
       'KNA', 'ETH', 'IRQ', 'HND', 'RWA', 'KHM', 'MCO', 'BGD', 'IMN',
       'TJK', 'NIC', 'BEN', 'VGB', 'TZA', 'GAB', 'GHA', 'TMP', 'GLP',
       'KEN', 'LIE', 'GNB', 'MNE', 'UMI', 'MYT', 'FRO', 'MMR', 'PAN',
       'BFA', 'LBY', 'MLI', 'NAM', 'BOL', 'PRY', 'BRB', 'ABW', 'AIA',
       'SLV', 'DMA', 'PYF', 'GUY', 'LCA', 'ATA', 'GTM', 'ASM', 'MRT',
       'NCL', 'KIR', 'SDN', 'ATF', 'SLE', 'LAO'], dtype=object)

Country names are represented in the ISO format.

In [5]:
df.shape
Out[5]:
(119390, 32)
In [6]:
df.columns
Out[6]:
Index(['hotel', 'is_canceled', 'lead_time', 'arrival_date_year',
       'arrival_date_month', 'arrival_date_week_number',
       'arrival_date_day_of_month', 'stays_in_weekend_nights',
       'stays_in_week_nights', 'adults', 'children', 'babies', 'meal',
       'country', 'market_segment', 'distribution_channel',
       'is_repeated_guest', 'previous_cancellations',
       'previous_bookings_not_canceled', 'reserved_room_type',
       'assigned_room_type', 'booking_changes', 'deposit_type', 'agent',
       'company', 'days_in_waiting_list', 'customer_type', 'adr',
       'required_car_parking_spaces', 'total_of_special_requests',
       'reservation_status', 'reservation_status_date'],
      dtype='object')
In [7]:
df.isnull().values.any()
Out[7]:
True
In [8]:
df.isnull().sum()
Out[8]:
hotel                                  0
is_canceled                            0
lead_time                              0
arrival_date_year                      0
arrival_date_month                     0
arrival_date_week_number               0
arrival_date_day_of_month              0
stays_in_weekend_nights                0
stays_in_week_nights                   0
adults                                 0
children                               4
babies                                 0
meal                                   0
country                              488
market_segment                         0
distribution_channel                   0
is_repeated_guest                      0
previous_cancellations                 0
previous_bookings_not_canceled         0
reserved_room_type                     0
assigned_room_type                     0
booking_changes                        0
deposit_type                           0
agent                              16340
company                           112593
days_in_waiting_list                   0
customer_type                          0
adr                                    0
required_car_parking_spaces            0
total_of_special_requests              0
reservation_status                     0
reservation_status_date                0
dtype: int64

We can see that the dataset has some kind of missing values.

it is clear that column "country" has 488, "agent" 16340 and "company" with the highest at 112593. It means that the booking is likely done without any agent and company involved.

In [9]:
df['adults'].unique()
Out[9]:
array([ 2,  1,  3,  4, 40, 26, 50, 27, 55,  0, 20,  6,  5, 10],
      dtype=int64)
In [10]:
df['children'].unique()
Out[10]:
array([ 0.,  1.,  2., 10.,  3., nan])
In [11]:
df['babies'].unique()
Out[11]:
array([ 0,  1,  2, 10,  9], dtype=int64)

It seems that here we have some dirtiness in data as "adults", "children", and "babies" cannot be zero at a same time.

In [12]:
filter=(df["adults"]==0) & (df["children"]==0) & (df["babies"]==0)
df[filter]
Out[12]:
hotel is_canceled lead_time arrival_date_year arrival_date_month arrival_date_week_number arrival_date_day_of_month stays_in_weekend_nights stays_in_week_nights adults ... deposit_type agent company days_in_waiting_list customer_type adr required_car_parking_spaces total_of_special_requests reservation_status reservation_status_date
2224 Resort Hotel 0 1 2015 October 41 6 0 3 0 ... No Deposit NaN 174.0 0 Transient-Party 0.00 0 0 Check-Out 10/6/2015
2409 Resort Hotel 0 0 2015 October 42 12 0 0 0 ... No Deposit NaN 174.0 0 Transient 0.00 0 0 Check-Out 10/12/2015
3181 Resort Hotel 0 36 2015 November 47 20 1 2 0 ... No Deposit 38.0 NaN 0 Transient-Party 0.00 0 0 Check-Out 11/23/2015
3684 Resort Hotel 0 165 2015 December 53 30 1 4 0 ... No Deposit 308.0 NaN 122 Transient-Party 0.00 0 0 Check-Out 1/4/2016
3708 Resort Hotel 0 165 2015 December 53 30 2 4 0 ... No Deposit 308.0 NaN 122 Transient-Party 0.00 0 0 Check-Out 1/5/2016
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
115029 City Hotel 0 107 2017 June 26 27 0 3 0 ... No Deposit 7.0 NaN 0 Transient 100.80 0 0 Check-Out 6/30/2017
115091 City Hotel 0 1 2017 June 26 30 0 1 0 ... No Deposit NaN NaN 0 Transient 0.00 1 1 Check-Out 7/1/2017
116251 City Hotel 0 44 2017 July 28 15 1 1 0 ... No Deposit 425.0 NaN 0 Transient 73.80 0 0 Check-Out 7/17/2017
116534 City Hotel 0 2 2017 July 28 15 2 5 0 ... No Deposit 9.0 NaN 0 Transient-Party 22.86 0 1 Check-Out 7/22/2017
117087 City Hotel 0 170 2017 July 30 27 0 2 0 ... No Deposit 52.0 NaN 0 Transient 0.00 0 0 Check-Out 7/29/2017

180 rows × 32 columns

In [13]:
df[~filter]
Out[13]:
hotel is_canceled lead_time arrival_date_year arrival_date_month arrival_date_week_number arrival_date_day_of_month stays_in_weekend_nights stays_in_week_nights adults ... deposit_type agent company days_in_waiting_list customer_type adr required_car_parking_spaces total_of_special_requests reservation_status reservation_status_date
0 Resort Hotel 0 342 2015 July 27 1 0 0 2 ... No Deposit NaN NaN 0 Transient 0.00 0 0 Check-Out 7/1/2015
1 Resort Hotel 0 737 2015 July 27 1 0 0 2 ... No Deposit NaN NaN 0 Transient 0.00 0 0 Check-Out 7/1/2015
2 Resort Hotel 0 7 2015 July 27 1 0 1 1 ... No Deposit NaN NaN 0 Transient 75.00 0 0 Check-Out 7/2/2015
3 Resort Hotel 0 13 2015 July 27 1 0 1 1 ... No Deposit 304.0 NaN 0 Transient 75.00 0 0 Check-Out 7/2/2015
4 Resort Hotel 0 14 2015 July 27 1 0 2 2 ... No Deposit 240.0 NaN 0 Transient 98.00 0 1 Check-Out 7/3/2015
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
119385 City Hotel 0 23 2017 August 35 30 2 5 2 ... No Deposit 394.0 NaN 0 Transient 96.14 0 0 Check-Out 9/6/2017
119386 City Hotel 0 102 2017 August 35 31 2 5 3 ... No Deposit 9.0 NaN 0 Transient 225.43 0 2 Check-Out 9/7/2017
119387 City Hotel 0 34 2017 August 35 31 2 5 2 ... No Deposit 9.0 NaN 0 Transient 157.71 0 4 Check-Out 9/7/2017
119388 City Hotel 0 109 2017 August 35 31 2 5 2 ... No Deposit 89.0 NaN 0 Transient 104.40 0 0 Check-Out 9/7/2017
119389 City Hotel 0 205 2017 August 35 29 2 7 2 ... No Deposit 9.0 NaN 0 Transient 151.20 0 2 Check-Out 9/7/2017

119210 rows × 32 columns

In [14]:
data=df[~filter]
In [15]:
data.head()
Out[15]:
hotel is_canceled lead_time arrival_date_year arrival_date_month arrival_date_week_number arrival_date_day_of_month stays_in_weekend_nights stays_in_week_nights adults ... deposit_type agent company days_in_waiting_list customer_type adr required_car_parking_spaces total_of_special_requests reservation_status reservation_status_date
0 Resort Hotel 0 342 2015 July 27 1 0 0 2 ... No Deposit NaN NaN 0 Transient 0.0 0 0 Check-Out 7/1/2015
1 Resort Hotel 0 737 2015 July 27 1 0 0 2 ... No Deposit NaN NaN 0 Transient 0.0 0 0 Check-Out 7/1/2015
2 Resort Hotel 0 7 2015 July 27 1 0 1 1 ... No Deposit NaN NaN 0 Transient 75.0 0 0 Check-Out 7/2/2015
3 Resort Hotel 0 13 2015 July 27 1 0 1 1 ... No Deposit 304.0 NaN 0 Transient 75.0 0 0 Check-Out 7/2/2015
4 Resort Hotel 0 14 2015 July 27 1 0 2 2 ... No Deposit 240.0 NaN 0 Transient 98.0 0 1 Check-Out 7/3/2015

5 rows × 32 columns

Now the data is clean and ready for analysis.

Exploratory Data Analysis¶

In [16]:
df['hotel'].unique()
Out[16]:
array(['Resort Hotel', 'City Hotel'], dtype=object)

Since we have two different hotels. We will separate them.

In [17]:
resort=data[(data['hotel']=="Resort Hotel") & (data["is_canceled"]==0)]
In [18]:
city=data[(data['hotel']=="City Hotel") & (data["is_canceled"]==0)]
In [19]:
resort.shape
Out[19]:
(28927, 32)
In [20]:
city.shape
Out[20]:
(46084, 32)
In [21]:
df = data.groupby('hotel')['hotel'].count()
sns.barplot(x=df.index, y=df)
Out[21]:
<AxesSubplot:xlabel='hotel', ylabel='hotel'>
In [22]:
country_data=data[data['is_canceled']==0]['country'].value_counts().reset_index()
country_data.columns=['country','No of guests']
country_data
Out[22]:
country No of guests
0 PRT 20977
1 GBR 9668
2 FRA 8468
3 ESP 6383
4 DEU 6067
... ... ...
160 BHR 1
161 DJI 1
162 MLI 1
163 NPL 1
164 FRO 1

165 rows × 2 columns

In [23]:
px.choropleth(country_data, locations=country_data['country'],
             color=country_data['No of guests'],
             hover_name=country_data['country'],
             title="Home country of guests")
In [24]:
data['meal'].unique()
Out[24]:
array(['BB', 'FB', 'HB', 'SC', 'Undefined'], dtype=object)
In [25]:
data['meal'].value_counts()
Out[25]:
BB           92236
HB           14458
SC           10549
Undefined     1169
FB             798
Name: meal, dtype: int64
In [26]:
data.groupby(['hotel','meal']).agg({'meal':'count'}).unstack()
Out[26]:
meal
meal BB FB HB SC Undefined
hotel
City Hotel 62233.0 44.0 6412.0 10474.0 NaN
Resort Hotel 30003.0 754.0 8046.0 75.0 1169.0
In [27]:
px.pie(data,names=data['meal'].value_counts().index,values=data['meal'].value_counts()
    )
In [28]:
months = ["January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December"]
In [29]:
df = data.groupby(['hotel', 'arrival_date_month'])['adr'].mean().reset_index()
df['arrival_date_month'] = pd.Categorical(df['arrival_date_month'], categories=months, ordered=True)
df.sort_values('arrival_date_month', inplace=True)
In [30]:
df
Out[30]:
hotel arrival_date_month adr
4 City Hotel January 82.754477
16 Resort Hotel January 49.507033
3 City Hotel February 85.327519
15 Resort Hotel February 55.189716
7 City Hotel March 92.973339
19 Resort Hotel March 57.554652
0 City Hotel April 111.397415
12 Resort Hotel April 77.849496
20 Resort Hotel May 78.758134
8 City Hotel May 121.764614
6 City Hotel June 119.186056
18 Resort Hotel June 110.481032
5 City Hotel July 110.945950
17 Resort Hotel July 155.181299
1 City Hotel August 114.857330
13 Resort Hotel August 186.790574
11 City Hotel September 110.120296
23 Resort Hotel September 93.252030
22 Resort Hotel October 62.132572
10 City Hotel October 100.119313
9 City Hotel November 88.372486
21 Resort Hotel November 48.313643
2 City Hotel December 89.209560
14 Resort Hotel December 69.051887
In [31]:
plt.figure(figsize=(9,6))
sns.lineplot(x='arrival_date_month', y='adr', hue='hotel', data=df)
plt.ylabel("Average daily price")
plt.xlabel("Months")
p = plt.xticks(rotation=30)
plt.title("Average Daily rate by months")
Out[31]:
Text(0.5, 1.0, 'Average Daily rate by months')

Conclusion¶

We can conclude that people from over the world say in these two hotels and most of guest are from Portugal and other European countries and 77.4% of bookings reserves are for Bed and Breakfast. We can see the mid-year months have higher daily prices as seen from the bookings for the high mid-year months.